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Abstract — We consider the problem of finding a control 
policy for a Markov Decision Process (MDP) to maximize the 
probability of reaching some states while avoiding some other 
states. This problem is motivated by applications in robotics, 
where such problems naturally arise when probabilistic models 
of robot motion are required to satisfy temporal logic task 
specifications. We transform this problem into a Stochastic 
Shortest Path (SSP) problem and develop a new approximate 
dynamic programming algorithm to solve it. This algorithm 
is of the actor-critic type and uses a least-square temporal 
difference learning method. It operates on sample paths of 
the system and optimizes the policy within a pre-specified 
class parameterized by a parsimonious set of parameters. We 
show its convergence to a policy corresponding to a stationary 
point in the parameters' space. Simulation results confirm the 
effectiveness of the proposed solution. 

Index Terms — Markov Decision Processes, dynamic pro- 
gramming, actor-critic methods, robot motion control, robotics. 



I. Introduction 

Markov Decision Processes (MDPs) have been widely 
used in a variety of application domains. In particular, 
they have been increasingly used to model and control 
autonomous agents subject to noises in their sensing and 
actuation, or uncertainty in the environment they operate. 
Examples include: unmanned aircraft [1], ground robots [2], 
and steering of medical needles [3]. In these studies, the 
underlying motion of the system cannot be predicted with 
certainty, but they can be obtained from the sensing and 
the actuation model through a simulator or empirical trials, 
providing transition probabilities. 

Recently, the problem of controlling an MDP from a 
temporal logic specification has received a lot of attention. 
Temporal logics such as Linear Temporal Logic (LTL) and 
Computational Tree Logic (CTL) are appealing as they 
provide formal, high level languages in which the behavior 
of the system can be specified (see [4]). In the context 

* Research partially supported by the NSF under grant EFRI-0735974, 
by the DOE under grant DE-FG52-06NA27490, by the ODDR&E MURI10 
program under grant N00014-10-1-0952, and by ONR MURI under gTant 
N00014-09-1051. 

f Reza Moazzez Estanjini and Jing Wang are with the Division of 
Systems Eng., Boston University, 8 St. Mary's St., Boston, MA 02215, 
email: {reza, wang j ing}@bu . edu. 

X Xu Chu Ding, Morteza Lahijanian, and Calin A. Belta are with the 
Dept. of Mechanical Eng., Boston University, 15 St. Mary's St., Boston, 
MA 02215, email: {xcding, morteza, cbelta}@bu . edu. 

§ Ioannis Ch. Paschalidis is with the Dept. of Electrical & Computer 
Eng., and the Division of Systems Eng., Boston University, 8 St. Mary's 
St., Boston, MA 02215, email: yannisp@bu.edu. 

§ Corresponding author 



of MDPs, providing probabilistic guarantees means finding 
optimal policies that maximize the probabilities of satisfying 
these specifications. In [2], [5], it has been shown that, the 
problem of finding an optimal policy that maximizes the 
probability of satisfying a temporal logic formula can be 
naturally translated to one of maximizing the probability 
of reaching a set of states in the MDP Such problems 
are referred to as Maximal Reachability Probability (MRP) 
problems. It has been known [3] that they are equivalent 
to Stochastic Shortest Path (SSP) problems, which belong 
to a standard class of infinite horizon problems in dynamic 
programming. 

However, as suggested in [2], [5], these problems usually 
involve MDPs with large state spaces. For example, in order 
to synthesize an optimal policy for an MDP satisfying an 
LTL formula, one needs to solve an MRP problem on a 
much larger MDP, which is the product of the original MDP 
and an automaton representing the formula. Thus, computing 
the exact solution can be computationally prohibitive for 
realistically-sized settings. Moreover, in some cases, the 
system of interest is so complex that it is not feasible to 
determine transition probabilities for all actions and states 
explicitly. 

Motivated by these limitations, in this paper we develop a 
new approximate dynamic programming algorithm to solve 
SSP MDPs and we establish its convergence. The algorithm 
is of the actor-critic type and uses a Least Square Temporal 
Difference (LSTD) learning method. Our proposed algorithm 
is based on sample paths, and thus only requires transition 
probabilities along the sampled paths and not over the entire 
state space. 

Actor-critic algorithms are typically used to optimize some 
Randomized Stationary Policy (RSP) using policy gradient 
estimation. RSPs are parameterized by a parsimonious set 
of parameters and the objective is to optimize the policy 
with respect to these parameters. To this end, one needs to 
estimate appropriate policy gradients, which can be done 
using learning methods that are much more efficient than 
computing a cost-to-go function over the entire state-action 
space. Many different versions of actor-critic algorithms have 
been proposed which have been shown to be quite efficient 
for various applications (e.g., in robotics [6] and navigation 
[7], power management of wireless transmitters [8], biology 
[9], and optimal bidding for electricity generation companies 
[10]). 

A particularly attractive design of the actor-critic archi- 
tecture was proposed in [11], where the critic estimates 



the policy gradient using sequential observations from a 
sample path while the actor updates the policy at the same 
time, although at a slower time-scale. It was proved that 
the estimate of the critic tracks the slowly varying policy 
asymptotically under suitable conditions. A center piece of 
these conditions is a relationship between the actor step-size 
and the critic step-size, which will be discussed later. 

The critic of [11] uses first-order variants of the Temporal 
Difference (TD) algorithm (TD(1) and TD(A)). However, it 
has been shown that the least squares methods - LSTD (Least 
Squares TD) and LSPE (Least Squares Policy Evaluation) - 
are superior in terms of convergence rate (see [12], [13]). 
LSTD and LSPE were first proposed for discounted cost 
problems in [12] and [14], respectively. Later, [13] showed 
that the convergence rate of LSTD is optimal. Their results 
clearly demonstrated that LSTD converges much faster and 
more reliably than TD(1) and TD(A). 

Motivated by these findings, we propose an actor-critic 
algorithm that adopts LSTD learning methods tailored to SSP 
problems, while at the same time maintains the concurrent 
update architecture of the actor and the critic. (Note that [15] 
also used LSTD in an actor-critic method, but the actor had 
to wait for the critic to converge before making each policy 
update.) To illustrate salient features of the approach, we 
present a case study where a robot in a large environment is 
required to satisfy a task specification of "go to a set of goal 
states while avoiding a set of unsafe states." (We note that 
more complex task specifications can be directly converted 
to MRP problems as shown in [2], [5].) 

The rest of the paper is organized as follows. We formulate 
the problem in Sec. [II] The LSTD actor-critic algorithm 
with concurrent updates is presented in Sec. ??, where the 
convergence of the algorithm is shown. A case study is 
presented in Sec. [V] We conclude the paper in Sec. [VlJ 

Notation: We use bold letters to denote vectors and 
matrices; typically vectors are lower case and matrices upper 
case. Vectors are assumed to be column vectors unless 
explicitly stated otherwise. Transpose is denoted by prime. 
For any mxn matrix A, with rows ai , . . . , a m G R n , v(A) 
denotes the column vector (ai, . . . , a m ). |j • || stands for the 
Euclidean norm and || • ||# is a special norm in the MDP state- 
action space that we will define later. denotes a vector or 
matrix with all components set to zero and I is the identity 
matrix. |5| denotes the cardinality of a set S. 

II. Problem Formulation 

Consider an SSP MDP with finite state and action spaces. 
Let k denote time, X denote the state space with cardinality 
|X|, and U denote the action space with cardinality |U|. Let 
Xfc G X and Uk G U be the state of the system and the action 
taken at time k, respectively. Let g(xk,Uk) be the one-step 
cost of applying action Uk while the system is at state X&. 
Let Xo and x* denote the initial state and the special cost- 
free termination state, respectively. Let p(j|xfc,Ufe) denote 
the state transition probabilities (which are typically not 
explicitly known); that is, p(j|x/-,Ufe) is the probability of 
transition from state x^ to state j given that action Uk 



is taken while the system is at state x&. A policy \i is 
said to be proper if, when using this policy, there is a 
positive probability that x* will be reached after at most 
|X| transitions, regardless of the initial state Xo. We make 
the following assumption. 

Assumption A 

There exist a proper stationary policy. 

The policy candidates are assumed to belong to a param- 
eterized family of Randomized Stationary Policies (RSPs) 
{/xe(u|x) I 6 G E™}. That is, given a state x G X 
and a parameter 9, the policy applies action u G U with 
probability ng{u\x). Define the expected total cost a(0) to 
be \im t ^oo E{J2k~l}o 9( x k,u k )\x } where u k is generated 
according to RSP fig(u\x). The goal is to optimize the 
expected total cost a(9) over the n -dimensional vector 0. 

With no explicit model of the state transitions but only 
a sample path denoted by {xk,Uk}, the actor-critic algo- 
rithms typically optimize locally in the following way: 
first, the critic estimates the policy gradient Va(6>) us- 
ing a Temporal Difference (TD) algorithm; then the actor 
modifies the policy parameter along the gradient direc- 
tion. Let the operator Pg denote taking expectation af- 
ter one transition. More precisely, for a function /(x, u), 
(P f)(x,u) = Ej e x,„GuMHj)p(j|x,u)/(j,t/). Define the 
Qe-value function to be any function satisfying the Poisson 
equation 

Qe(x, it) = #(x, u) + (PgQ e )(x, u), 

where Qg (x, u) can be interpreted as the expected future 
cost we incur if we start at state x, apply control u, and 
then apply RSP /ig. We note that in general, the Poisson 
equation need not hold for SSP, however, it holds if the policy 
corresponding to RSP /ig is a proper policy [16]. We make 
the following assumption. 

Assumption B 

For any 6, and for any x € X, /j,g(u\x) > if action u is 
feasible at state x, and /J,g(u\x) = otherwise. 

We note that one possible RSP for which Assumption [B] 
holds is the "Boltzmann" policy (see [17]), that is 
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where h g u ^ (x) is a function that corresponds to action u and 
is parameterized by 0. The Boltzmann policy is simple to 
use and is the policy that will be used in the case study in 
Sec. [V] 

Lemma II.l If Assumptions^ and \E\ hold, then for any 6 
the policy corresponding to RSP \ig is proper. 

Proof: The proof follows from the definition of a proper 
policy. ■ 
Under suitable ergodicity conditions, {xfc} and {xfc,itfc} 
are Markov chains with stationary distributions under a fixed 



policy. These stationary distributions are denoted by 7rg(x) 
and 77e(x, u), respectively. We will not elaborate on the 
ergodicity conditions, except to note that it suffices that 
the process {x^} is irreducible and aperiodic given any 
0, and Assumption M holds. Denote by Qg the (|X||U|)- 
dimensional vector Qg = (Qg(x,u); Vx G X, u G U). Let 
now 

ip e (x,u) = V e ln^ e (w|x), 

where -0 e (x, w) = when x, u are such that fig(u\x) = 
for all 0. It is assumed that ip e (x, u) is bounded 
and continuously differentiable. We write tp (x, u) — 
(■0g(x, u), . . . , i\) r g (x, u)) where n is the dimensionality of 0. 
As we did in defining Qg we will denote by xjjg the (|X| |U|)- 
dimensional vector xjjg — (ip e (x,u); Vx G X, u G U). 

A key fact underlying the actor-critic algorithm is that the 
policy gradient can be expressed as (Theorem 2.15 in [13]) 



da(0) 
~d9~ 



= (Qe,tp' l g)e, i = l,. 
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where for any two functions /i and /2 of x and u, expressed 
as (|X| |U|) -dimensional vectors fi and f2, we define 

(fi,f 2 )e = ^7]e(x,u)/i(x,u)/ 2 (x,it). (2) 

Let || • ||e denote the norm induced by the inner product (0, 
i.e., ||f||g = (f,f) e . Let also .Yg be the subspace of K' X H U I 
spanned by the vectors ij> e , i = 1, . . . , n and denote by Hg 
the projection with respect to the norm || • \\g onto Sfg, i.e., 
for any f G ]Rj x H u l, n^f is the unique vector in Sfg that 
minimizes ||f — f ||# over all f £ Sfg. Since for all i 

(Qe,i> }e = {n g Q g ,il> e )e, 

it is sufficient to know the projection of Qg onto S*e in 
order to compute Va(9). One possibility is to approximate 
Qe with a parametric linear architecture of the following 
form (see [11]): 



Q e (x,u) = i/>' e (x,u)r*, 



(3) 



This dramatically reduces the complexity of learning from 
the space Rl x ll u l to the space R n . Furthermore, the temporal 
difference algorithms can be used to learn such an r* 
effectively. The elements of ip e (x, u) are understood as 
features associated with an (x, u) state-action pair in the 
sense of basis functions used to develop an approximation 
of the Qe-value function. 

III. Actor-Critic Algorithm Using LSTD 

The critic in [11] used either TD(A) or TD(1). The 
algorithm we propose uses least squares TD methods (LSTD 
in particular) instead as they have been shown to provide 
far superior performance. In the sequel, we first describe 
the LSTD actor-critic algorithm and then we prove its 
convergence. 



A. The Algorithm 

The algorithm uses a sequence of simulated trajectories, 
each of which starting at a given xo and ending as soon 
as x* is visited for the first time in the sequence. Once a 
trajectory is completed, the state of the system is reset to the 
initial state xo and the process is repeated. 

Let Xfc denote the state of the system at time k. Let r^, 
the iterate for r* in (0), be the parameter vector of the critic 
at time k, Ok be the parameter vector of the actor at time 
k, and x^+i be the new state, obtained after action u& is 
applied when the state is x^. A new action Uk+i is generated 
according to the RSP corresponding to the actor parameter 
Ok (see [11]). The critic and the actor carry out the following 
updates, where z,t e R" represents Sutton's eligibility trace 
[17], bfe G M™ refers to a statistical estimate of the single 
period reward, and Afc G M rixn is a sample estimate of 
the matrix formed by z k (ip'g k (x k+1 ,u k+1 ) - ^ fe (x fe ,it fe )), 
which can be viewed as a sample observation of the scaled 
difference of the observation of the state incidence vector 
for iterations k and k + 1, scaled to the feature space by the 
basis functions. 

LSTD Actor-Critic for SSP 
Initialization: 

Set all entries in zq, Ao,bo and ro to zeros. Let Oq take 
some initial value, potentially corresponding to a heuristic 
policy. 
Critic: 

z k+ i = Az fc + ip 9k (x fe , u k ), 
b fc+ i = h k + 7 fc [g(-x k ,u k )z k - b k ] , 
A fc+ i = A k + j k [z k (i>'g k (x k+1 ,u k+1 ) - ipg k (x k ,u k )) 
-Ad, 



where A G [0,1), 7^ = — , and finally 

k 



rfc+i 



(4) 



(5) 



Actor: 



Ok+i = k - f3 k T(r k )r' k ip ek (x k+1 ,u k+1 )i> gk (x k+ i,u k+ i). 

(6) 

In the above, {7/c} controls the critic step-size, while {f3 k } 
and r(r) control the actor step-size together. An implemen- 
tation of this algorithm needs to make these choices. The 
role of T(r) is mainly to keep the actor updates bounded, 
and we can for instance use 



D 



T(r) 



1. 



if ||r|| > D, 
otherwise, 



for some D > 0. {f3 k } is a deterministic and non-increasing 
sequence for which we need to have 



^2 Pk = OO, 

k 



k 



Pi < 00, lim — = 0. 

fe^oo 7^ 



(7) 



An example of {fik} satisfying Eq. (Q is 



where c > is a constant parameter. Also, xjj (x, u) is 
defined as 

ip g (x,u) = V e ln^ e (M|x), 

where ip e (x,u) = when x, u are such that p,g{u\x) = 
for all 6. It is assumed that xjj e (x, u) is bounded 
and continuously differentiable. Note that ip e (x,u) — 
(tpg(x, u), . . . , ipg (x, u)) where n is the dimensionality of 6. 
The convergence of the algorithm is stated in the following 
Theorem (see the Appendix for the proof). 

Theorem III.l [Actor Convergence] For the LSTD actor- 
critic with some step-size sequence {/3fe} satisfying 0, for 
any e > 0, there exists some A sufficiently close to 1, such 
that liminffc ||Va(0fe)| < e w.p.l. That is, 6^ visits an 
arbitrary neighborhood of a stationary point infinitely often. 

IV. The MRP and its conversion into an SSP 

PROBLEM 

In the MRP problem, we assume that there is a set of 
unsafe states which are set to be absorbing on the MDP 
(i.e., there is only one control at each state, corresponding to 
a self-transition with probability 1). Let Xg and Xy denote 
the set of goal states and unsafe states, respectively. A safe 
state is a state that is not unsafe. It is assumed that if the 
system is at a safe state, then there is at least one sequence of 
actions that can reach one of the states in Xg with positive 
probability. Note that this implies that Assumption lAl holds. 
In the MRP, the goal is to find the optimal policy that 
maximizes the probability of reaching a state in Xg from 
a given initial state. Note that since the unsafe states are 
absorbing, to satisfy this specification the system must not 
visit the unsafe states. 

We now convert the MRP problem into an SSP problem, 
which requires us to change the original MDP (now denoted 
as MDP M ) into a SSP MDP (denoted as MDP§). Note that 
[3] established the equivalence between an MRP problem and 
an SSP problem where the expected reward is maximized. 
Here we present a different transformation where an MRP 
problem is converted to a more standard SSP problem where 
the expected cost is minimized. 

To begin, we denote the state space of MDPm by Xm, and 
define Xg, the state space of MDP§, to be 

X s = (X M \X G )U{x*}, 

where x* denotes a special termination state. Let xo denote 
the initial state, and U denote the action space of MDPm. 
We define the action space of MDPg to be U, i.e., the same 
as for MDP M . 

Let pm(j|x, u) denote the probability of transition to state 
j £ Xm if action u is taken at state x £ Xm- We now define 



the transition probability pg(j|x, u) for all states x,j £ X§ 
as: 

!^p M (i|x,u), ifj=x*, 
iex G (9) 
PmC)|x, u), if j £ X M \X G , 

for all x £ Xm \ (X G U Xjj) and all aell. Furthermore, we 
set p§(x*|x*, u) — 1 and p§(xo|x, u) = 1 if x € Xy, for all 
u £ U. The transition probability of MDP§ is defined to be 
the same as for MDPm, except that the probability of visiting 
the goal states in MDPm is changed into the probability of 
visiting the termination state; and the unsafe states transit to 
the initial state with probability 1. 

For all x £ X5, we define the cost g(x, u) = 1 if x £ X[/, 
and g(x, u) = otherwise. Define the expected total cost 
of a policy \i to be = lim^oo ^{2fc=o 5( x fc, "fc)l*o} 
where actions Uk are obtained according to policy /1 in 
MDPg. Moreover, for each policy /1 on MDPg, we can 
define a policy on MDPm to be the same as /i for all states 
x £ Xm \ (X G U Xy). Since actions are irrelevant at the goal 
and unsafe states in both MDPs, with slight abuse of notation 
we denote both policies to be [i. Finally, we define the 
Reachability Probability R^f as the probability of reaching 
one of the goal states from x under policy /1 on MDPm. 
The Lemma below relates _R*f and af ,: 

Lemma IV.l For any RSP fi, we have Rf = 

Proof: From the definition of the g(x, u), is the 
expected number of times when unsafe states in Xu are 
visited before x* is reached. From the construction of MDP§, 
reaching x* in MDP§ is equivalent to reaching one of the 
goal states in MDPm- On the other hand, for MDPm, by 
definition of X G and X^, in the Markov chain generated by 
/1, the states Xg and Xjj are the only absorbing states, and 
all other states are transient. Thus, the probability of visiting 
a state in Xjj from Xo on MDPm is 1 — which is the 
same as the probability of visiting Xjj for each run of MDP§, 
due to the construction of transition probabilities (0. We can 
now consider a geometric distribution where the probability 
of success is . Because is the expected number of 
times when an unsafe state in Xy is visited before x* is 
reached, this is the same as the expected number of failures 
of Bernoulli trails (with probability of success being RT) 

before a success. This implies = —^r-- Rearranging 

l — R M M 

= nui' 1 * completes the proof. ■ 
The above lemma means that fi as a solution to the SSP 
problem on MDP§ (minimizing a~) corresponds to a solution 
for the MRP problem on MDP M (maximizing ). Note that 
the algorithm uses a sequence of simulated trajectories, each 
of which starting at xo and ending as soon as x* is visited for 
the first time in the sequence. Once a trajectory is completed, 
the state of the system is reset to the initial state xo and the 
process is repeated. Thus, the actor-critic algorithm is applied 
to a modified version of the MDPg where transition to a goal 
state is always followed by a transition to the initial state. 



V. Case study 

In this section we apply our algorithm to control a robot 
moving in a square-shaped mission environment, which is 
partitioned into 2500 smaller square regions (a 50 x 50 grid) 
as shown in Fig. Q] We model the motion of the robot in the 
environment as an MDP: each region corresponds to a state 
of the MDP, and in each region (state), the robot can take 
the following control primitives (actions): "North", "East", 
"South", "West", which represent the directions in which the 
robot intends to move (depending on the location of a region, 
some of these actions may not be enabled, for example, in 
the lower-left corner, only actions "North" and "East" are 
enabled). These control primitives are not reliable and are 
subject to noise in actuation and possible surface roughness 
in the environment. Thus, for each motion primitive at a 
region, there is a probability that the robot enters an adjacent 
region. 



Fig. 1 . View of the mission environment. The initial region is marked by 
o, the goal regions by x, and the unsafe regions are shown in black. 



We label the region in the south-west corner as the 
initial state. We marked the regions located at the other 
three corners as the set of goal states as shown in Fig. Q] 
We assume that there is a set of unsafe states Xry in the 
environment (shown in black in Fig. [TJ. Our goal is to find 
the optimal policy that maximizes the probability of reaching 
a state in Xq ( set of goal states) from the initial state (an 
instance of an MRP problem). 

A. Designing an RSP 

To apply the LSTD Actor-Critic algorithm, the key step is 
to design an RSP /ie(u|x). In this case study, we define the 
RSP to be an exponential function of two scalar parameters 
6 1 and 6*2, respectively. These parameters are used to provide 
a balance between safety and progress from applying the 
control policy. 

For each pair of states x,,Xj G X, we define d(xj,Xj) 
as the minimum number of transitions from xj and x^. We 
denote x,- G A^(xj) if and only if d(xj,x,-) < r„, where r n 



is a fixed integer given apriori. If Xj G N(xi), then we say 
Xj is in the neighborhood of Xj , and r n represents the radius 
of the neighborhood around each state. 

For each state x G X, the safety score s(x) is defined as 
the ratio of the safe neighbouring states over all neighboring 
states of x. To be more specific, we define 



S (x) 



Sy67v( x ) My) 



(10) 



|W(X)| 

where I s {y) is an indicator function such that I s (y) = 1 
if and only if y G X \ Xry and I s (y) — if otherwise. A 
higher safety score for the current state of robot means it 
is less likely for the robot to reach an unsafe region in the 
future. 

We define the progress score of a state x G X as 
d ff (x) := min ye x G d(x,y), which is the minimum number 
of transitions from x to any goal region. We can now propose 
the RSP policy, which is a Boltzmann policy as defined in 
((TJ. Note that U = {u\, U2, 143, U4}, which corresponds to 
"North", "East", "South", and "West", respectively. We first 
define 

ai (d) = F l {x)e eiEW{x ' u ' m+e2E{ds ( f{x ' Ui)) - dg{x)} , 

(11) 

where 9 := (61, #2), and -Fj(x) is an indicator function such 
that .Fi(x) = 1 if Ui is available at Xj and i'i(x) = if 
otherwise. Note that the availability of control actions at a 
state is limited for the states at the boundary. For example, 
at the initial state, which is at the lower-left corner, the set of 
available actions is {141,1x2}, corresponding to "North" and 
"East", respectively. If an action 14^ is not available at state 
x, we set a,i(8) = 0, which means that /ze(i4j|x) = 0. 

Note that <2i(0) is defined to be the combination of the 
expected safety score of the next state applying control Ui, 
and the expected improved progress score from the current 
state applying Ui, weighted by 9i and 62- The RSP is then 
given by 

am 



fj, e (ui\x) 



(12) 



We note that Assumption [B] holds for the proposed RSP. 
Moreover, Assumption lAl also holds, therefore Theorem III. 11 
holds for this RSP. 

B. Generating transition probabilities 

To implement the LSTD Actor-Critic algorithm, we first 
constructed the MDP. As mentioned above, this MDP repre- 
sents the motion of the robot in the environment where each 
state corresponds to a cell in the environment (Fig. [T). To 
capture the transition probabilities of the robot from a cell 
to its adjacent one under an action, we built a simulator. 

The simulator uses a unicycle model (see, e.g., [19]) for 
the dynamics of the robot with noisy sensors and actuators. 
In this model, the motion of the robot is determined by spec- 
ifying a forward and an angular velocity. At a given region, 
the robot implements one of the following four controllers 
(motion primitives) - "East", "North", "West", "South". Each 
of these controllers operates by obtaining the difference 



between the current heading angle and the desired heading 
angle. Then, it is translated into a proportional feedback 
control law for angular velocity. The desired heading angles 
for the "East", "North", "West", and "South" controllers are 
0°, 90°, 180°, and 270°, respectively. Each controller also 
uses a constant forward velocity. The environment in the 
simulator is a 50 by 50 square grid as shown in Fig. Q] To 
each cell of the environment, we randomly assigned a surface 
roughness which affects the motion of the robot in that cell. 
The perimeter of the environment is made of walls, and when 
the robot runs to them, it bounces with the mirror-angle. 

To find the transition probabilities, we performed a total of 
5000 simulations for each controller and state of the MDP. In 
each trial, the robot was initialized at the center of the cell, 
and then an action was applied. The robot moved in that 
cell according to its dynamics and surface roughness of the 
region. As soon as the robot exited the cell, a transition was 
encountered. Then, a reliable center-converging controller 
was automatically applied to steer the robot to the center 
of the new cell. We assumed that the center-converging 
controller is reliable enough that always drives the robot 
to the center of the new cell before exiting it. Thus, the 
robot always started from the center of a cell. This makes 
the process Markov (the probability of the current transition 
depends only the control and the current state, and not on 
the history up to the current state). We also assumed perfect 
observation at the boundaries of the cells. 

It should be noted that, in general, it is not required to 
have all the transition probabilities of the model in order 
to apply the LSTD Actor-Critic algorithm, but rather, we 
only need transition probabilities along the trajectories of the 
system simulated while running the algorithm. This becomes 
an important advantage in the case where the environment 
is large and obtaining all transition probabilities becomes 
infeasible. 

C. Results 

We first obtained the exact optimal policy for this prob- 
lem using the methods described in [2], [5]. The maximal 
reachability probability is 99.9988%. We then used our 
LSTD actor-critic algorithm to optimize with respect to 9 
as outlined in Sec. [Til] and [TV] 

Given 9, we can compute the exact probability of reaching 
X<3 from any state x E X applying the RSP fig by solving 
the following set of linear equations 

«eu yex 

for all x £ X \ (Xu U X G ) (13) 

such that pe(x) = if x e X{/ and p#(x) = 1 if x g Xq. 
Note that the equation system given by (fT3l contains exactly 
X| — |X{/| — |Xg| number of equations and unknowns. 

We plotted in Fig. [2] the reachability probability of the 
RSP from the initial state (i.e., pg(xo)) against the number 
of iterations in the actor-critical algorithm each time 9 
is updated. As 9 converges, the reachability probability 
converges to 90.3%. The parameters for this examples are: 
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Fig. 2. The dashed line represents the optimal solution (the maximal 
reachability probability) and the solid line represents the exact reachability 
probability for the RSP as a function of the number of iterations applying 
the proposed algorithm. 



r n = 2, A = 0.9, D = 5 and the initial 9 is (50, -10). We 
use © for f3 k with c = 0.05. 

VI. Conclusion 

We considered the problem of finding a control policy for a 
Markov Decision Process (MDP) to maximize the probability 
of reaching some states of the MDP while avoiding some 
other states. We presented a transformation of the problem 
into a Stochastic Shortest Path (SSP) MDP and developed a 
new approximate dynamic programming algorithm to solve 
this class of problems. The algorithm operates on a sample- 
path of the system and optimizes the policy within a pre- 
specified class parameterized by a parsimonious set of pa- 
rameters. Simulation results confirm the effectiveness of the 
proposed solution in robot motion planning applications. 

Appendix: Convergence of the LSTD 
Actor-Critic Algorithm 

We first cite the theory of linear stochastic approximation 
driven by a slowly varying Markov chain [13] (with simpli- 
fications). 

Let {yk} be a finite Markov chain whose transition 
probabilities depend on a parameter 9 G K™. Consider a 
generic iteration of the form 

s fe+ i = s fc + 7 fc (h e)c (y fc+ i) - G 6k (y k+1 )s k ) + j k S k s k , 

(14) 

where s fe £ R"\ and hg(-) e M m ,G e (-) 6 R™*™ are 9- 
parameterized vector and matrix functions, respectively. It 
has been shown in [13] that the critic in (fl4l i converges if 
the following set of conditions are met. 



Condition 1 

1) The sequence {7fc} is deterministic, non-increasing, 
and 

k 

2) The random sequence {0fc} satisfies |0fc+i — 0fe|| < 
PkHkfor some process {H^} with bounded moments, 
where {/3fe} is a deterministic sequence such that 



E 



—— } < oo for some d > 0. 

Ik, 



3) Hfe is an m x m-matrix valued martingale difference 
with bounded moments. 

4) For eac/z 0, there exist h(0) G R m , G(0) G M TOXro , 
ant/ corresponding m-vector and m x m-matrix func- 
tions he(-), Ge(-) fnaf satisfy the Poisson equation. 
That is, for each y, 

h e (y) = h e (y)-h(0) + (P e h e )(y), 

G fl (y) = G e (y) - G(0) + (P e G e )(y). 

5) For some constant C and for all 6, we have 
max(||h(0)||,||G(0)||) < C. 

6) For any d > 0, there exists Cd > such that 
sup fe E[||f<9 fc (y fe )j| d ] < C d , where f e (-) represents any 
of the functions he(-), he(-), Ge(-) and Gg(-). 

7) For some constant C > one/ /or aZZ 0,0 G M n , 
max(||h(0) - h(0)||, |G(0) - G(0)||) < C||0 - 0||. 

8) There exists a positive measurable function C(-) such 
that for every d > 0, sup fc E[C(y/-) ] < oo, and 
||fe(y)-fe(y)ll<C(y)||0-0||. 

9) There exists a > such that for all s G W l and 
G R n 

s'G(0)s > a||s|| 2 . 

For now, let's focus on the first two items of Condition Q] 
Recall that for any matrix A, v(A) is a column vector that 
stacks all row vectors of A (also written as column vectors). 
Simple algebra suggests that the core iteration of the LSTD 
critic can be written as (fl4l with 

b fc 

s,« = v(A fc ) 
1 

g(x,u)z 

My) = v(z((P e Ve)(x, u) - ip' e (x, u))) 

1 

Ge(y) = [ I ] , 

D 



y fe = (xfc,Ufc,Zfc), 



(15) 













where 



D = v(z fc ('0e J( (xfc + i,Ufc+i) - (•Pe'0e)'(xfe,Ufc))), 

and M is an arbitrary (large) positive constant whose role is 
to facilitate the convergence proof, and y = (x, u, z) denotes 
a value of the triplet y&. 



The step-sizes 7^ and f3k in and © correspond exactly 
to the 7^ and f3k in Condition[T](l) andQ](2), respectively. If 
the MDP has finite state and action space, then the conditions 
on {/3 fc } reduce to ([13]) 



k 



/3fc = oo, ^^<oo, 



lim h. = 0, 



(16) 



where {/3fc} is a deterministic and non-increasing sequence. 
Note that we can use 7^ = 1/k (cf. Condition [T). The 
following theorem establishes the convergence of the critic. 

Theorem VI.l [Critic Convergence] For the LSTD actor- 
critic (0 and with some step-size sequence satis- 
fying ( 1761 ), f/ie sequence /s bounded, and 



lim |G(0 fc )s fc 



&(0fe)| =0. 



(17) 



Proof: To show that ( TT4"i > converges with 
s, y, he(-), Ge(-) and S substituted by $15[ , the conditions 
Q~|(l)-(9) should be checked. However, a comparison with 
the convergence proof for the TD(A) critic in [11] gives a 
simpler proof. Let 

F e (y) = b(^(x,«) - (Pgil> g y(x,u)). 

While proving the convergence of TD(A) critic operating 
concurrently with the actor, [11] showed that 
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. h< 2) (y) _ 




g(x,u)z 



G e (y) 
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Fe(y) 



and 








satisfy Condition[T](3)-[Tj8). In our case, Sl5[ can be rewritten 



he(y) 



h^(y) 



(2), 

My) 
1 



G e (y) 



I , Et 



(18) 

Note that although the two iterates are very different, they 
involve the same quantities and both in a linear fashion. 
So, hg(-),Gg(-) and E^ also satisfy conditions |T](3)-|TJ8). 
Meanwhile, the step-size {7/,} satisfies condition [T](l), and 
the step-size {/3/c} satisfies Eq. ([Tol l (which is as explained 
above implies condition [T|(2)). Now, only condition (9) 
remains to be checked. To that end, note that all diagonal ele- 
ments of Gg (y) equal to one, so, Gg (y) is positive definite. 
This proves the convergence. Using the same correspondence 
and the result in [11], one can further check that ( TT7T > also 
holds here. ■ 



Proof of Theorem \III.1\ 

The result follows by setting <p e 
proof in Section 6 of [11]. 



tpg and following the 
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